289 research outputs found
Can deep learning help you find the perfect match?
Is he/she my type or not? The answer to this question depends on the personal
preferences of the one asking it. The individual process of obtaining a full
answer may generally be difficult and time consuming, but often an approximate
answer can be obtained simply by looking at a photo of the potential match.
Such approximate answers based on visual cues can be produced in a fraction of
a second, a phenomenon that has led to a series of recently successful dating
apps in which users rate others positively or negatively using primarily a
single photo. In this paper we explore using convolutional networks to create a
model of an individual's personal preferences based on rated photos. This
introduced task is difficult due to the large number of variations in profile
pictures and the noise in attractiveness labels. Toward this task we collect a
dataset comprised of pictures and binary labels for each. We compare
performance of convolutional models trained in three ways: first directly on
the collected dataset, second with features transferred from a network trained
to predict gender, and third with features transferred from a network trained
on ImageNet. Our findings show that ImageNet features transfer best, producing
a model that attains accuracy on the test set and is moderately
successful at predicting matches
Deep learning and reinforcement learning methods for grounded goal-oriented dialogue
Les systèmes de dialogues sont à même de révolutionner l'interaction entre l'homme et la machine. Pour autant, les efforts pour concevoir des agents conversationnels se sont souvent révélés infructueux, et ceux, malgré les dernières avancées en apprentissage profond et par renforcement. Les systèmes de dialogue palissent de devoir opérer sur de nombreux domaines d'application mais pour lesquels aucune mesure d'évaluation claire n'a été définie. Aussi, cette thèse s'attache à étudier les dialogues débouchant sur un objectif clair (goal-oriented dialogue) permettant de guider l'entrainement, et ceci, dans des environnements multimodaux. Plusieurs raisons expliquent ce choix : (i) cela contraint le périmètre de la conversation, (ii) cela introduit une méthode d'évaluation claire, (iii) enfin, l'aspect multimodal enrichie la représentation linguistique en reliant l'apprentissage du langage avec des expériences sensorielles. En particulier, nous avons développé GuessWhat?! (Qu-est-ce donc?!), un jeu imagé coopératif où deux joueurs tentent de retrouver un objet en posant une série de questions. Afin d’apprendre aux agents de répondre aux questions sur les images, nous avons développés une méthode dites de normalisation conditionnée des données (Conditional Batch Nornalization). Ainsi, cette méthode permet d'adapter simplement mais efficacement des noyaux de convolutions visuels en fonction de la question en cours. Enfin, nous avons étudié les tâches de navigation guidée par dialogue, et introduit la tâche Talk the Walk (Raconte-moi le Chemin) à cet effet. Dans ce jeu, deux agents, un touriste et un guide, s'accordent afin d'aider le touriste à traverser une reconstruction virtuelle des rues de New-York et atteindre une position prédéfinie.While dialogue systems have the potential to fundamentally change human-machine interaction, developing general chatbots with deep learning and reinforce-ment learning techniques has proven difficult. One challenging aspect is that these systems are expected to operate in broad application domains for which there is not a clear measure of evaluation. This thesis investigates goal-oriented dialogue tasks in multi-modal environments because it (i) constrains the scope of the conversa-tion, (ii) comes with a better-defined objective, and (iii) enables enriching language representations by grounding them to perceptual experiences. More specifically, we develop GuessWhat, an image-based guessing game in which two agents cooper-ate to locate an unknown object through asking a sequence of questions. For the subtask of visual question answering, we propose Conditional Batch Normalization layers as a simple but effective conditioning method that adapts the convolutional activations to the specific question at hand. Finally, we investigate the difficulty of dialogue-based navigation by introducing Talk The Walk, a new task where two agents (a “tourist” and a “guide”) collaborate to have the tourist navigate to target locations in the virtual streets of New York City
Learning Visual Reasoning Without Strong Priors
Achieving artificial visual reasoning - the ability to answer image-related
questions which require a multi-step, high-level process - is an important step
towards artificial general intelligence. This multi-modal task requires
learning a question-dependent, structured reasoning process over images from
language. Standard deep learning approaches tend to exploit biases in the data
rather than learn this underlying structure, while leading methods learn to
visually reason successfully but are hand-crafted for reasoning. We show that a
general-purpose, Conditional Batch Normalization approach achieves
state-of-the-art results on the CLEVR Visual Reasoning benchmark with a 2.4%
error rate. We outperform the next best end-to-end method (4.5%) and even
methods that use extra supervision (3.1%). We probe our model to shed light on
how it reasons, showing it has learned a question-dependent, multi-step
process. Previous work has operated under the assumption that visual reasoning
calls for a specialized architecture, but we show that a general architecture
with proper conditioning can learn to visually reason effectively.Comment: Full AAAI 2018 paper is at arXiv:1709.07871. Presented at ICML 2017's
Machine Learning in Speech and Language Processing Workshop. Code is at
http://github.com/ethanjperez/fil
FiLM: Visual Reasoning with a General Conditioning Layer
We introduce a general-purpose conditioning method for neural networks called
FiLM: Feature-wise Linear Modulation. FiLM layers influence neural network
computation via a simple, feature-wise affine transformation based on
conditioning information. We show that FiLM layers are highly effective for
visual reasoning - answering image-related questions which require a
multi-step, high-level process - a task which has proven difficult for standard
deep learning methods that do not explicitly model reasoning. Specifically, we
show on visual reasoning tasks that FiLM layers 1) halve state-of-the-art error
for the CLEVR benchmark, 2) modulate features in a coherent manner, 3) are
robust to ablations and architectural modifications, and 4) generalize well to
challenging, new data from few examples or even zero-shot.Comment: AAAI 2018. Code available at http://github.com/ethanjperez/film .
Extends arXiv:1707.0301
End-to-end optimization of goal-driven and visually grounded dialogue systems
End-to-end design of dialogue systems has recently become a popular research
topic thanks to powerful tools such as encoder-decoder architectures for
sequence-to-sequence learning. Yet, most current approaches cast human-machine
dialogue management as a supervised learning problem, aiming at predicting the
next utterance of a participant given the full history of the dialogue. This
vision is too simplistic to render the intrinsic planning problem inherent to
dialogue as well as its grounded nature, making the context of a dialogue
larger than the sole history. This is why only chit-chat and question answering
tasks have been addressed so far using end-to-end architectures. In this paper,
we introduce a Deep Reinforcement Learning method to optimize visually grounded
task-oriented dialogues, based on the policy gradient algorithm. This approach
is tested on a dataset of 120k dialogues collected through Mechanical Turk and
provides encouraging results at solving both the problem of generating natural
dialogues and the task of discovering a specific object in a complex picture
GuessWhat?! Visual object discovery through multi-modal dialogue
We introduce GuessWhat?!, a two-player guessing game as a testbed for
research on the interplay of computer vision and dialogue systems. The goal of
the game is to locate an unknown object in a rich image scene by asking a
sequence of questions. Higher-level image understanding, like spatial reasoning
and language grounding, is required to solve the proposed task. Our key
contribution is the collection of a large-scale dataset consisting of 150K
human-played games with a total of 800K visual question-answer pairs on 66K
images. We explain our design decisions in collecting the dataset and introduce
the oracle and questioner tasks that are associated with the two players of the
game. We prototyped deep learning models to establish initial baselines of the
introduced tasks.Comment: 23 pages; CVPR 2017 submission; see https://guesswhat.a
- …